Sydney House Prices


House prices in Sydney have been the subject of great attention in Australia and globally. Specifically, for their extraordinarily high prices. Being a resident of Sydney, I was interested in seeing the relative prices across the different suburbs I live around. I wanted a way I could visualise these geospatial relationships myself.

A choropleth map (from Greek χῶρος choros 'area/region' and πλῆθος plethos 'multitude') is a type of thematic map in which a set of pre-defined areas is colored or patterned in proportion to a statistical variable that represents an aggregate summary of a geographic characteristic within each area, such as population density or per-capita income.

Gathering Data

The data I used can be found here. I use a YAML file for configurations parameters. I use this mainly for more readable code and easier parameter tweaking to help me in the future. They look similar to a dictionary format with keys and value pairs.

YAML (a recursive acronym for "YAML Ain't Markup Language") is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted.

Inspect and Cleaning the Data

As with all data science tasks, we want to inspect our data and gather some elementary information about the features, labels and variable types.


There are 199,504 entries, spanning from 2011-04-16 to 2019-06-19. Quite a large number of data points here. There are 8 columns (zero-indexed) of various types.

The sellPrice and propType columns should be appropriately changed to floats as selling prices are continious variables and property types to categories as they are... categorical. This will make pandas methods more informative.

Great! Our values now are of a more appropriate type. Now to continue inspecting the dataframe.

There are a few columns that are reduntant for our analyses. We can remove the id and postalCode columns.

To quickly get an overview of our data and it's statistical properties we can use the describe method on the dataframe. We tranpose the dataframe to make viewing it easier.

Immediately some interesting properties stand out.

There is some work to do dealing with the outliers in this dataset.

We can begin by removing the max value in our dataset and then check how this has impacted our summary statistics.

Supposedly this property was sold in 2010, in Zetland, with 99 bedrooms and 41 car spaces. Even if this was a big property development (block of units, skyscraper) this price does not make sense. For further context the GDP of Estonia is around the same as this outlier.

Lets drop this value.

Lets double check our change.

Woah! Our standard deviation has dropped significantly, as expected, by around $5.5m$. This is important as any inferences or analyses would have been quite off the mark if we included our Estonia priced property.

Let us continue by addressing the lower range of our dataset. I believe a sensible amount for the lower range would be property prices greater than $\$10,000$.

After some preliminary and elementary data preprocessing we can now explore our data and find answers to some interesting questions. Maybe we can begin with:

Exploratory Data Analysis

Which 10 suburbs sold the most properties?


Interesting, Castle Hill located 30 kilometres north-west of the Sydney central business districtand 9.5 kilometres north of Parramatta, tops the list. It is within the Hills District region, split between the local government areas of The Hills Shire and Hornsby Shire. Castle Hill residents have a personal income that is 18.9% greater than the median national income, according to the 2016 Census. This may indicate that Castle Hill may be of interest to property analysts.

Another interesting metric, which will later be used for the choropleth map is the median house price (a statistic that isn't skewed by outliers) for each suburb.

Out of interest, the large outlier which was the 20.7b property, was located in Zetland, where the median house price is 1.130m. More evidence that this value was bonkers.

What are the suburbs with the highest median prices in Sydney?

Unsuprisingly Point Piper is the highest median selling price. Next on the list is Collaroy Beach, another coastal suburb but in the Northern Beaches. It seems that the old adage that coastal properties house the elite may be correct according to these high selling prices.

The next step to get a better feel for our data is to visualise some relationships.

Visualisations

Our data has a temporal dimension. With each sale there is a timestamp that is attached to that sale. An interesting insight may be which where the most popular months of the sale of properties around sydney.

As we see January is the month in which the least number of properties were sold. March was the highest month sold.

Distributions

What are the distribution shapes of our features and labels?

Plotly Histograms

Lets produce the histograms that we will use for our dashboard.

Cumalative Distribution Functions

Another great method to get a feel for our data. Sometimes the data plotted with histograms can be misleading depending on the number of bins we use. This is called binning bias and Cumaltive Distributions are a great way to get a cleaner picture for analysis.

Side Note: A rule of thumb if you are ever unsure of the bin size is the square root rule.

Correlation Heatmap

Our correlation heatmap indicates that there are very weak linear relationships between our interesting variables. The only strong relationsips are between bed and baths.

Scatter Plots

We can use a scatter plot to confirm our conclusion in made with the correlation heatmap. As we can see Bed and Baths have a somwhat positively linear relationship.

Boxplot

A boxplot is a useful way to compare continous values across different categorical types. Here we can analyse differences between different property types in Sydney.

Construction of Choropleth Map

To create a choropleth map using plotly we need a couple of things:

We imported a useful helper function that will grab the relevant json from the internet and saves it as a geopandas dataframe. This will be useful when merging the data.

We can see that the nsw_loca_2 is the most obvious key to match on. The first column are our geometry parameters that will plotly will use to plot our map. However one issue still exists. We need to make sure the strings in the nsw_loca_2are formatted in the same way as the median statistics dataframe. From a quick inspection that dataframe had suburbs in a proper noun format. We will change the geopandas df to reflect this.

Awesome! Now we can merge our two dataframes. We want to do an inner merge, which will match all rows that are common to both dataframes.

We will need a mapbox token to use the style specified below. If this isn't found in the enviornment variable MAPBOX_TOKEN, we will check the root repository for .mapbox_token and if all else fails we can use carto-darkmatter which doesn't require an API key.

Linear Regression

Lets quickly run a Linear Regression to see if we can come up with a meaningful model. Note that when doing preliminary analysis our correlations were low. This suggests that a higher order model would be more appropriate. We can expect our model to perform badly, but lets check it out anyway.

As expected our model doesn't explain any meaningful variation and performs worse on our test set suggesting that the model is not robust either. Bummer.